-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cub::DeviceTransform
#2086
Add cub::DeviceTransform
#2086
Conversation
c48b0e8
to
02c1aff
Compare
445a3aa
to
84bb0ad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments. Regarding the address stability detection, I think it's a good idea to separate it into a different PR. There's no risk in merging only opt-in mechanism for Thrust.
5ad3a62
to
6efb3d7
Compare
c2ed3ce
to
e76e241
Compare
e39bae3
to
a414e36
Compare
I removed all the address stability detection from this PR. I will follow up on this separately. |
5b99eb5
to
5a92ac0
Compare
d592a4c
to
bff304d
Compare
Calling the < Function : _ZN3cub17CUB_200700_860_NS6detail8for_each13static_kernelINS2_12policy_hub_t12policy_350_tElNS1_9transform10dispatch_tILb0ElN4cuda3std3__45tupleIJN6thrust20THRUST_200700_860_NS6detail15normal_iteratorINSD_10device_ptrInEEEEEEESI_ZN53_INTERNAL_c645bd8d_15_babelstream1_cu_885ae364_1399973mulInlEEvRN7nvbench5stateENSM_9type_listIJT_T0_EEEEUlRKnE_12policy_hub_tIJSI_EEE28non_contiguous_fallback_op_tIJLm0EEEEEEvSR_T1_
---
> Function : _ZN3cub17CUB_200700_860_NS6detail9transform16transform_kernelIN12policy_hub_tIJN6thrust20THRUST_200700_860_NS6detail15normal_iteratorINS6_10device_ptrInEEEEEE10max_policyElZN53_INTERNAL_c645bd8d_15_babelstream1_cu_885ae364_1393953mulInlEEvRN7nvbench5stateENSG_9type_listIJT_T0_EEEEUlRKnE_SB_JPnEEEvSL_iT1_T2_DpNS2_10kernel_argIT3_EE
86c86
< /*00d0*/ IADD3 R4, P1, R2.reuse, c[0x0][0x170], RZ ; /* 0x00005c0002047a10 */
---
> /*00d0*/ IADD3 R4, P1, R2.reuse, c[0x0][0x188], RZ ; /* 0x0000620002047a10 */
88c88
< /*00e0*/ IADD3 R2, P2, R2, c[0x0][0x178], RZ ; /* 0x00005e0002027a10 */
---
> /*00e0*/ IADD3 R2, P2, R2, c[0x0][0x180], RZ ; /* 0x0000600002027a10 */
90c90
< /*00f0*/ IADD3.X R5, R0.reuse, c[0x0][0x174], RZ, P1, !PT ; /* 0x00005d0000057a10 */
---
> /*00f0*/ IADD3.X R5, R0.reuse, c[0x0][0x18c], RZ, P1, !PT ; /* 0x0000630000057a10 */
92c92
< /*0100*/ IADD3.X R3, R0, c[0x0][0x17c], RZ, P2, !PT ; /* 0x00005f0000037a10 */
---
> /*0100*/ IADD3.X R3, R0, c[0x0][0x184], RZ, P2, !PT ; /* 0x0000610000037a10 */
207c207 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incredible speedups on H200. Great job!
🟩 CI finished in 10h 16m: Pass: 100%/251 | Total: 1d 09h | Avg: 7m 57s | Max: 34m 05s | Hits: 99%/24441
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 251)
# | Runner |
---|---|
178 | linux-amd64-cpu16 |
42 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
This PR creates
cub::DeviceTransform
and makes the Thrust CUDA backend use it for.thrust::transform
The new
cub::DeviceTransform
algorithm comes with a set of kernels and corresponding algorithms:fallback_for
: reproduces the old behavior ofthrust::transform
and just forwards to the agent ofcub::DeviceFor
. This is needed for tuning and to compare against a baseline. We may want to remove thisbefore merging orin a separate PR, once we can cover all cases with better variants.prefetch
: does not bitwise-copy any data, but rather prefetches the addresses before starting to pull in data for transformation. this kernel also guarantees stable parameter address to function objects. this forms the new baseline for all GPUs if we cannot copy the parameters before passing them to the function objects (thus, the ultimate baseline kernel).unrolled_staged
(bikeshed name?): has two unrolled stages, once for loadingITEMS_PER_THREAD
elements from each input stream into registers, followed by a compute stage, that invokes the function object. Ensures the initial data fetch is large and is separate from writing, so the compiler does not detect any aliasing problems. this forms the new baseline for all GPUs if we can copy the parameters before passing them to the function objects.memcpy_async
: usesmemcpy_async
(requires Ampere) to start a bunch for memory loading transactions to shared memory for before crunching multiple iterations on shared memory. Register friendly algorithm and elements per threads are determined at runtime.ublkcp
: same as previous but usesublkcp
(requires Hopper). The setup is a bit more complex due to alignment.All kernelsThe ublkcp kernel derives their elements per thread from the target minimum number of bytes-in-flight (BIF).Onlyunrolled_staged
does this at compile-time though. However, the unrolling seems to be beneficial over prefetching, especially on the non-data center grade GPUs. From my current understanding, we need the 4 code paths/kernels.GH200 tuning top 10
CUB babelstream fallbackfor vs. ublkcp on H200
Fixes: #2091